A Variant of K-Means Clustering through Heuristic Initial Seed Selection for Improved Clustering of Data
نویسندگان
چکیده
Unsupervised clustering algorithms have been used in many applications to group the data based on relevant similarity metrics. K-Means clustering is one of the most widely used clustering techniques owing to its simplicity. Many improvements and extensions have been proposed for this algorithm in view to improve its performance. Out of the various dimensions that have been explored in this regard such as mean computation, centroid representation, initial seed/cluster centre selection and similarity calculation methods, the choice of initial cluster centre is found to have a profound impact in the performance of the algorithm. Existing methods chose the cluster centres either randomly or based on heuristics such as maximum distance property, maximum probability of the squared distance, points with maximum points lying close to it etc. In this paper, a strategy to select relevant initial cluster centres for two-cluster grouping problems is proposed based on the measures indicating the statistical distribution of the data in view to improve the clustering performance in terms of accuracy. These measures include minimum, maximum, median, mean and skew of the data. The algorithm is validated on datasets from UCI repository viz. Balance, BloodDonate, Diabetes, Ionosphere, Parkinsons and Sonar and synthetic datasets. The performance of the proposed algorithm is compared with K-Means and its variants and found to achieve better performance in terms of accuracy. An increase is accuracy of approximately 0.25%-18% is observed across the datasets.
منابع مشابه
Data Clustring Using A New CGA(Chaotic-Generic Algorithm) Approach
Clustering is the process of dividing a set of input data into a number of subgroups. The members of each subgroup are similar to each other but different from members of other subgroups. The genetic algorithm has enjoyed many applications in clustering data. One of these applications is the clustering of images. The problem with the earlier methods used in clustering images was in selecting in...
متن کاملData Clustring Using A New CGA(Chaotic-Generic Algorithm) Approach
Clustering is the process of dividing a set of input data into a number of subgroups. The members of each subgroup are similar to each other but different from members of other subgroups. The genetic algorithm has enjoyed many applications in clustering data. One of these applications is the clustering of images. The problem with the earlier methods used in clustering images was in selecting in...
متن کاملPersistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm
Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...
متن کاملUse of the Improved Frog-Leaping Algorithm in Data Clustering
Clustering is one of the known techniques in the field of data mining where data with similar properties is within the set of categories. K-means algorithm is one the simplest clustering algorithms which have disadvantages sensitive to initial values of the clusters and converging to the local optimum. In recent years, several algorithms are provided based on evolutionary algorithms for cluster...
متن کاملGROUND MOTION CLUSTERING BY A HYBRID K-MEANS AND COLLIDING BODIES OPTIMIZATION
Stochastic nature of earthquake has raised a challenge for engineers to choose which record for their analyses. Clustering is offered as a solution for such a data mining problem to automatically distinguish between ground motion records based on similarities in the corresponding seismic attributes. The present work formulates an optimization problem to seek for the best clustering measures. In...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016